Chinese-English Statistical Machine Translation by Parsing
نویسندگان
چکیده
Statistical machine translation (SMT) has evolved from the word-based level to higher levels of abstraction. Currently the best known systems are phrased-based, and recent research has started to explore tree-based systems with syntactical information. This thesis aims to study large-scale Chinese-English SMT using a syntactic tree-based model. From the engineering point of view, SMT systems are very complex to build. However, existing pieces of software can bring the work load to a manageable level for this thesis. Using the GenPar framework and other software, this thesis studies Chinese-English SMT by parsing by large-scale experiments. This is the first application of GenPar on Chinese-English SMT. The experiments show that the accuracy of Chinese-English SMT by parsing is comparable to existing SMT by parsing of other language pairs. However, the accuracy of current MT methods is still largely below human translation, and is influenced by the difference between training and testing data, such as the writing style and domain. Two important factors in the SMT by parsing model are studied, and it is observed that though the accuracy of word-to-word alignment influences the translation accuracy, the mono-lingual English and Chinese grammars do not have a significant impact on the results. From the above observations, advantages and weaknesses of the SMT model are analysed, and possible future improvements for Chinese-English SMT are suggested. This thesis is organised in three main parts. The first chapter presents the introduction and overview of the thesis. The second and third chapters summarise the related theories by literature review, giving a detailed exposition of the theory of SMT and SMT by parsing. The last two chapters report the novel experiments of Chinese-English SMT by generalised parsing. By discussing the experimental output, the last chapter summarises this thesis and proposes further work.
منابع مشابه
Machine Translation Through Clausal Syntax : A Statistical Approach for Chinese to English by Dan Lowe Wheeler
Language pairs such as Chinese and English with largely differing word order have proved to be one of the greatest challenges in statistical machine translation. One reason is that such techniques usually work with sentences as flat strings of words, rather than explicitly attempting to parse any sort of hierarchical structural representation. Because even simple syntactic differences between l...
متن کاملA Hybrid System for Chinese-English Patent Machine Translation
This paper presents a novel hybrid system, which combines rule-based machine translation (RBMT) with phrase-based statistical machine translation (SMT), to translate Chinese patent texts into English. The hybrid architecture is basically guided by the RBMT engine which processes source language parsing and transformation, generating proper syntactic trees for the target language. In the generat...
متن کاملCombining Linguistics and Statistics for High-Quality Limited Domain English-Chinese Machine Translation
Second language learning is a compelling activity in today’s global markets. This thesis focuses on critical technology necessary to produce a computer spoken translation game for learning Mandarin Chinese in a relatively broad travel domain. Three main aspects are addressed: efficient Chinese parsing, high-quality English-Chinese machine translation, and how these technologies can be integrate...
متن کاملCross Language Dependency Parsing using a Bilingual Lexicon
This paper proposes an approach to enhance dependency parsing in a language by using a translated treebank from another language. A simple statistical machine translation method, word-by-word decoding, where not a parallel corpus but a bilingual lexicon is necessary, is adopted for the treebank translation. Using an ensemble method, the key information extracted from word pairs with dependency ...
متن کاملDependency-based Pre-ordering for Chinese-English Machine Translation
In statistical machine translation (SMT), syntax-based pre-ordering of the source language is an effective method for dealing with language pairs where there are great differences in their respective word orders. This paper introduces a novel pre-ordering approach based on dependency parsing for Chinese-English SMT. We present a set of dependency-based preordering rules which improved the BLEU ...
متن کامل